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Motivated by empirical observations of algebraic duplicated sequence length distributions in a 
broad range of natural genomes, we analytically formulate and solve a class of simple discrete dupli- 
cation/substitution models that generate steady-states sharing this property. Continuum equations 
are derived for arbitrary time-independent duplication length source distribution, a limit that we 
show can be mapped directly onto certain fragmentation models that have been intensively studied 
by physicists in recent years. Quantitative agreement with simulation is demonstrated. These mod- 
els account for the algebraic form and exponent of naturally occuring duplication length distributions 
without the need for fine-tuning of parameters. 



PACS numbers: 

A century has elapsed since the earliest reports of the 
evolutionary impact of gene duplication [l|. At the time 
there existed only a macroscopic and phenomenological 
conception of genetic material, but within the last decade 
static characterization of the finest details of the latter 
has become routine. Its dynamics, on the other hand, re- 
mains for the most part only indirectly accessible; 'snap- 
shots 'of complete individual genomes at short time inter- 
vals are not yet practical, and dynamics must be inferred 
from their cumulative effect on representative genome se- 
quences. 

This dynamics is important because to a good approx- 
imation genome sequence determines, via natural selec- 
tion, the fates of individuals and of species - but our un- 
derstanding of how this happens is primitive. Contempo- 
rary lineages are our primary source of genome sequence, 
making it difficult to associate the presence or absence 
of most genomic features with their effects, if any, on an 
individual. Indeed, a primary goal of modern genomics 
is to determine if, when, and on what time scales the 
sequence evolution reflects selection. 

Neutral models of sequence evolution - sequence dy- 
namics that, on the time scales of interest, are indepen- 
dent of selection - underlie all methods that we know 
of to achieve this goal[2|. When sequences common to 
two different organisms, or that appear multiply within 
the same genome, exhibit identity exceeds, (falls short 
of) that expected on given model of neutral evolution, it 
is taken as evidence either that negative ( positive) se- 
lection is acting on these sequences, or that the neutral 
model is flawed. As a given sequence fragment has some 
chance of exhibiting any excess or shortfall of identity 
within a neutral model, selection is inferred probablis- 
tically by studying frequencies of the levels of sequence 
identity within or between genomes[2|. Thus, length dis- 
tributions of similar or identical sequences have tradition- 
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ally played a fundamental role in genomics and molecular 
genetics, and our interpretation of genomic sequence re- 
lies upon our understanding of these distributions. 

The topic of this manuscript originates in an empir- 
ical observation of algebraic duplicated-sequence length 
distributions in a broad range of natural genomes |3H6|- 
We present several models of neutral evolution of chro- 
mosomes. Nominally neutral processes, such as dupli- 
cation or point mutation, are ascribed critical roles in 
genome evolution [7[. In [8[ an empirical model of du- 
plication was proposed that accounted for the observed 
algebraic distribution of duplicated sequence lengths in 
natural genomes, but relied on tuning the length distri- 
bution of the duplication source. Here we formulate and 
solve an alternative model for which no such tuning is 
required. 

The action of duplication is to copy a sequence frag- 
ment and subsequently to insert it, or to substitute 
it for a same-sized sequence fragment, elsewhere in a 
chromosome 9] . As there are more complicated processes 
in genomes with similar macroscopic effects, this is just 
an idealization. Standard models of sequence evolution 
incorporate random, uncorrelated base substitution. 

In the following we describe explicit models for 
this process, derive corresponding equations, solve 
them, compare the exact dynamics with the empirical 
one for certain sources investigated previously[8| and 
present a continuum generalization of the dynamics that 
maps directly onto fragmentation processes from kinetic 
theory [13 - 

A chromosome consists of a string of L bases chosen 
from a finite alphabet; in natural genomes the alphabet 
typically consists of the four bases A,G,C, and T; for sim- 
plicity and without loss of generality we use here a two 
base alphabet. The fundamental sequence element that 
we study is the the set of repeated sequences within the 
chromosome, counted in an algorithm-independent way. 
Specifically, we study 'supermaximal repeats' (or 'super 
maxmers'): sequence duplications neither copy of which s 
contained in any longer sequence duplication within the 



chromosome p, |8|. From now on, we refer to a super- 
maxmer of length m simply as an m-mer. 

Within our models one duplication occurs at each dis- 
crete time step t = 1,2,...: namely, a subsequence of 
the length m is chosen randomly within the chromosome 
according to a predetermined source distribution P(m) 
and is susbtitutcd for a sequence of length m at another 
randomly chosen position in the chromosome. It was 
numerically demonstrated [8j that for monoscale sources 
and certain power-law source distributions, the duplica- 
tion length distribution attained a stationary state at 
long times. 

We denote the ensemble-averaged number of m-mers 
at time t as f(t,m). For a monoscale source, we ex- 
pect at stationarity that the f(t,m) will decay rapidly 
for to > D. There are two processes altering the num- 
ber of 777-mers: a new duplication of fixed length D can 
fragment an existing m-mer or generate a new m-mers 
by fragmenting a longer m-mer; processes of higher or- 
der in D/L are ignored, where D, L >> 1 but D << L. 
Finally, we assume that at each time step one new Z3-mer 
is introduced by the source. 

The time dependence of / can be represented by terms 
of the form uf(t, m), u being a coefficient describing the 
rate of creation or destruction of corresponding m-mers. 
An m-mer is annihilated when a newly created duplica- 
tion of length D overlaps with one of the sequences com- 
posing the m-mer; the rate of this event is 2(D+m— 1)/L. 
Alternatively an m-mer may be created when a newly 
created duplication overlaps with an 777, + k-mer, k > 0. 
The probability that a 777 + fc-mer produces an m-mer 
is 4a/ L, where a has the unit of a single base. The 
monoscale source is represented by the Kronecker delta 
5 C (D — 777). Our expressions are valid when D << L; 

Supermaxmers may also be annihilated by base sub- 
stitution. Substitution occurs with rate /i per time unit 
per unit length (in bases); duplication with rate /3, mea- 
sured in 1/time unit. Then the balance equation takes 
the form 



f(t+l,m)-f(t,m) 



D 



-j3 + /i777 



f(t,m)- 



-a/3 + 4a/j, 



£; f(t,k) + 2pS c (D-m). (1) 
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The solution of (UJ converges to a stationary solution. 
To see this, take [/, = 0, j3 = 1, a = 1 [base], and note 
that ([1]) may be represented in matrix form as f(t + 1) = 
Af(t) + S c , where the matrix A is such that An = 2(1 — 
{D + 1)/L), Aij = 4/L for i < j, and A tJ = 0, for i > j, 
i,j = 1, 2, ... D; the vector 5 C is Sn = 2, Si = 0, i < D. 
The matrix A is upper triangular and its eigenvalues are 
readily computed yielding A& = l — j3(D+(k—l))/L, k = 
1,2, ... ,D. It is evident that < |Afc| < 1 for all k, as we 



assumed D << L and k — 1, 2, ... D, and consequently, 
the iteration is guaranteed to converge. For fi ^ the 
eigenvalues have the form Afe = 1 — (i m+ ^ 1 — fim, k — 
1,2, ... ,D, thus the requirement of the convergence to a 
stationary state |A| < 1 yields (approximately) D « L, 
/jAt < 1/D. 

If some initial state /(0) is given and if we denote by 
f s the limiting stationary state of the system, we can 
calculate f s as follows 



f a = lim /(£) = lim 

t— >GO t— ¥OQ 
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where A = TAT -1 and T diagonalizes A and consists of 
eigen vectors of A. 

As the eigenvalues are computed as above, the limiting 
state exists and does not depend on the initial condition. 
Further estimates show that f s = LTAT~ 1 S, where, e.g., 
for the case /j = we have Aj k = l/(D+k — l)), for j = k 
and Ajk = when j ^ k. We may write down the exact 
stationary solution of the equation ([lj in scalar form 



f{m,D,L,jj) 
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Comparisons to the empirical model [8( with ([3]) for 
/i / are presented in fig. [TJ It is evident that with 
increasing base substitution rate both simulations and 
the solution demonstrate power-law behavior with the 
exponent —3. To obtain this exponent from the solution 
([3]) observe that / in ([3]) is approximately represented as 



g"{x), where g(x) 
obtains 



(D - x)/{fiS±s. + MX ). Then one 
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For dynamics described by a power-law source 77(777) ~ 
I/777 7 the characteristic scale D is replaced by the first 
moment of the source. We also make all lengths dimen- 
sionless, dividing them by the length of 1 base a or by 
the first moment of the source E. Then we have for prob- 
abilities of fragmentation 



p(m) 



1 



1 



m 7 (C(7)-C(7,^+l)) mT^i( 7 ,JV)' 



where L = Na, C(t) i s the Riemann zeta-function, 
£(7, iV+1) is the generalized zeta-function. The equation 
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FIG. 1: Curves represent stationary states of the system de- 
scribed in [8| for various base substitution rates fj, and corre- 
sponding analytic solutions of the system (JTJ. The chromo- 
some length L = 10 7 ; source length D — 1024. Increasing 
base substitution rates exhibit a power-law tail with the ex- 
ponent —3. 



FIG. 2: Curves represent stationary states of the system de- 
scribed in [U] for various first moments E of the power-law 
source of the form p(x) ~ l/(mo + x 1 ) compared to the solu- 
tion of 0; L — 10 7 , n = 0, 7 = —2.4. Deviation of solution 
from simulations is observed for small E, when the condition 
E >> a is violated. For large lengths it is observed the regime 
with the slope 7 + 1 described in [8J]. 



for a finite-size system with a power-law source can be 
obtained similarly to that for the monoscale source and 
has the form 
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and m,r — 1,2, ... ,N. 

The equation ((4]) can be solved for a monoscale source. 
The matrix of the system is again upper-triangular. The 
eigenvalues are given by 



A fc = 1 - A- 



ka^N + {k - l)p + (j> 2 {l,N)P 

TV 



(5) 



where </> 2 (7, N) = ^1(7 - l,N)/^>i(j,N). The last rela- 
tions show how the source influences the eigenvalues and 
that this influence diminishes with increasing system size. 
Using the representation for the matrix A of the sys- 
tem ((U) A — TAT -1 , where A is a diagonal matrix and 
T is the matrix whose columns are eigen vectors of A, 
we obtain the solution of ((4]) in the same way as for ([2]); 
A has Afc from (J5J on the main diagonal and b describes 
the source. The stationary solution does not depend on 
initial conditions /(0) if |Afc| < 1, k = 1, 2, . . . , N. Con- 
sider the conditions that are then imposed the parame- 
ters. If [i, = 0, then 



Afc = I - Ar/3 



fc-l + 2 ( 7 ,AQ 
N 



It follows, in part, that 



(3At- 



l + 2 (7, N) 



N 



<2. 



The maximum value of k is N and the last inequality is 
valid for all N provided that 7 > 2 because in the discrete 
system we normally have At = 1 and /3 = 1. If /i 7^ 
then the last inequality is written as follows 



kafiAr + /3At- 



-I +/3 AtMZ^ <2 . 
N H N 

The last term is small; the middle term can be of the 
order of unity as k — > N and if Nafx ~ 1, assuming 
At = 1, then the inequality may be violated. Thus a 
rough condition for convergence to a stationary solution 
is L/zAr = 0(1). 

The structure of the source makes an analytic repre- 
sentation unwieldy; the solution of Q as t —> 00 was ob- 
tained numerically. The comparison of the solution with 
simulations is presented in fig. [2J Evidently equation ^ 
reproduces both exponent and amplitude. 

Finally, some continuum limits following from these 
discrete dynamics are obtained. As all lengths are mea- 
sured in bases we dimensionalize them as follows: a — 
a/E, fn = m/E, L = L/E, p, = E/j,. Then (JU with an 
arbitrary source P(fh) takes the form 



Af(t,m) 
At 



= -2 
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FIG. 3: Curves represent stationary states of the system de- 
scribed in the Letter for L = 10 7 and varying E and /j com- 
pared to the solution of Q. These are shown the regime 
(X « E corresponding to the continuum equation and the 
regime with small mutation rate \x studied in [8J], 7 = —3.0. 
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FIG. 4: Neither natural chromosomes were repeatmasked. C. 
elegans chr. 1 was translated to a 2 base alphabet (G— >A; 
C— »T). The simulation was according to the model in [g] with 
ju = 



The equation is dimensionless with respect to length. 
Taking a — \ and introducing densities / = af, P = ap 
with L = Na, fh = ka, the equation becomes 



A/(t,m) 
At 



1 + ka- 



4 



+ 4£i 



Na 

N 
k=m-\-l 



-ft + {iak 



f+ 



2/3p(m) 



(7) 



As a — >• we assume a << E, and as /i diminishes with 
a, we take E 1 ss /i _1 so that /U ~ 1. However it is impor- 
tant to stress that as E appears in both substitution and 
duplication terms of the equation, to achieve indepen- 
dent variation of orders of these terms it is necessary to 
increase \i for fixed E; the order of the substitution terms 
may be increased while the order of other terms does not 
vary. We keep the product ka finite and denote it by x; 
this implies k ~ a~ 1 and corresponds to an intermediate 
regime[12] for ©. Then the left-hand side, the source 
term and both terms containing mutation rate ft have 
the order ~ 1. Further details of the behaviour depend 
on the orders of other terms. Taking TV ~ a~ a we see 
that the next order terms are ~ a a , this regime cor- 
responds to L — > oo and L » E. Thus, the equation 
takes the form 



-± = -2xjif + 4ji / f(t,y)dy + 2(3p(x)+0(a a - 1 ). 



' ( 1L 
dt 

The main order regime corresponds to fragmentation 
with input studied in [lOj. The coefficients of the ne- 
glected terms of the leading order are —2(1 + x)/3 and 
4/3 J fdy; these terms are responsible for duplications. 



For the numerical simulations in fig. [T]wc set L = 10 7 , 
E = D « 10 3 and vary fx; thus, L » E. As fi ap- 
proaches 10 3 the output distribution approaches an alge- 
braic form with exponent —3, corresponding to the so- 
lution of the fragmentation equation for a monodisperse 



source 



lfj,[ll|. Fig. [5] corresponds to the regime with 
vanishing fj, and is not described by the equation; fig. 
[3] demonstrates various regimes and slope —3 which is 
observed for E » a and Ep, ~ 1. 

From (0 it also follows that the duplicates are dilute 
and do not collide. In this sense each m-mer evolves 
independently of other m-mers. Following Ben-Naim[10| 
we can estimate the fragment distribution as follows: 

Consider a duplication of length £ in a dilute system 
undergoing base substitution. Duplicates are broken into 
fragments whose number M varies in time asM = Eyd, so 
that the average fragment length at time t is I = E/M. 
Consecutive mutations are independent, hence the dis- 
tribution of fragment lengths is geometric and close to 
Poisson for a, << 1, i.e., p(l) ~ l^ 1 cxp(—l/l); in this 
normalization we neglect terms exponentially decaying 
in E. Then the number n(t, I) of fragments of length I 
is Mp(l). To compute the number of such fragments as 
t — > oo we need to integrate over time as new fragments 
appear continuously, obtaining 



n{l) 



Mj3p(l)dt 



I A" 3 



In fig. 2] we provide comparison of natural data with 
our simulations. 

The dynamics we presented reproduces the regime typ- 
ically observed in real genomes. Natural genomes demon- 
strate —3 slope of length distributions for many species 
[5|. It is important to stress that this regime is repro- 
duced by our model [8| as well as by the model suggested 



in the paper. However, unlike [8] the state of the model 
can be done independent of the source of duplications. 
Thus the state of many currently studied genomes may 
be understand as continuous interaction of point substi- 
tutions and segmental duplications which are generated 
by some source. These results also account for the ex- 
istence of very long identical sequences (e.g., ultracon- 
served elements) in terms of neutral evolution. On the 
other hand, the assumptions for ([7]) may be altered to ob- 
tain different regimes and thus, to include genomes whose 
state deviates from —3 regime because of, e.g, extensive 
recent duplications. 
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